Linear Upper Confidence Bound Algorithm for Contextual Bandit Problem with Piled Rewards
نویسندگان
چکیده
We study the contextual bandit problem with linear payoff function. In the traditional contextual bandit problem, the algorithm iteratively chooses an action based on the observed context, and immediately receives a reward for the chosen action. Motivated by a practical need in many applications, we study the design of algorithms under the piled-reward setting, where the rewards are received as a pile instead of immediately. We present how the Linear Upper Confidence Bound (LinUCB) algorithm for the traditional problem can be näıvely applied under the piled-reward setting, and prove its regret bound. Then, we extend LinUCB to a novel algorithm, called Linear Upper Confidence Bound with Pseudo Reward (LinUCBPR), which digests the observed contexts to choose actions more strategically before the piled rewards are received. We prove that LinUCBPR can match LinUCB in the regret bound under the piled-reward setting. Experiments on the artificial and real-world datasets demonstrate the strong performance of LinUCBPR in practice.
منابع مشابه
Pseudo-reward Algorithms for Contextual Bandits with Linear Payoff Functions
We study the contextual bandit problem with linear payoff functions, which is a generalization of the traditional multi-armed bandit problem. In the contextual bandit problem, the learner needs to iteratively select an action based on an observed context, and receives a linear score on only the selected action as the reward feedback. Motivated by the observation that better performance is achie...
متن کاملContextual Bandits with Linear Payoff Functions
In this paper we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d dimensional feature vectors, we prove an O (√ Td ln(KT ln(T )/δ) ) regret bound that holds with probability 1− δ for the simplest known (both conceptually and computationally) efficient upper confidence bound algorithm...
متن کاملCorrupt Bandits
We study a variant of the stochastic multi-armed bandit (MAB) problem in which the rewards are corrupted. In this framework, motivated by privacy preserving in online recommender systems, the goal is to maximize the sum of the (unobserved) rewards, based on the observation of transformation of these rewards through a stochastic corruption process with known parameters. We provide a lower bound ...
متن کاملContextual Bandits with Stochastic Experts
We consider the problem of contextual bandits with stochastic experts, which is a variation of the traditional stochastic contextual bandit with experts problem. In our problem setting, we assume access to a class of stochastic experts, where each expert is a conditional distribution over the arms given a context. We propose upper-confidence bound (UCB) algorithms for this problem, which employ...
متن کاملBandit Forest
To address the contextual bandit problem, we propose online decision tree algorithms. The analysis of proposed algorithms is based on the sample complexity needed to find the optimal decision stump. Then, the decision stumps are assembled in a decision tree, Bandit Tree, and in a random collection of decision trees, Bandit Forest. We show that the proposed algorithms are optimal up to a logarit...
متن کامل